Large-Scale RL

Value Function Approximation

It is more useful to solve problems with large state spaces

By now we had our value function as a table, but it is not possible to store all the states in a table. We need to approximate the value function.

For large MDPs, we can't store all the states and/or actions in the memory. We need to approximate the value function. It is also slow to learn the value of each state individually. We need to generalize the value function.

Solution for Large MDPs

Function Approximation

There are many ways to approximate the value function. Some of them are:

we consider differentiable function approximators, because we can use gradient-based optimization methods to update the weights. Those are suitable for non-stationary, non-i.i.d. data.

Since we are updating the policy, the data is non-stationary. The data is also non-i.i.d. because the states are correlated.

Gradient Descent

Let J(w)J(w) be the objective function that we want to minimize. We can update the weights w using the gradient of the objective function.

Gradient of the objective function:

∇J(w)=[∂J(w)∂w1∂J(w)∂w2⋮∂J(w)∂wn]\nabla J(w) = \begin{bmatrix} \frac{\partial J(w)}{\partial w_1} \\ \frac{\partial J(w)}{\partial w_2} \\ \vdots \\ \frac{\partial J(w)}{\partial w_n} \end{bmatrix}

To find the minimum of the objective function, we update the weights using the gradient:

Δw=−12α∇J(w)\Delta w = -\frac{1}{2}\alpha \nabla J(w)

where α\alpha is the learning rate, or step size.

Stochastic Gradient Descent

In stochastic gradient descent, we want to find a parameter vector w that minimizes the expected value of the loss function:

J(w)=Eπ[(Vπ(s)−V(s,w))2]J(w) = E_{\pi}[(V_{\pi}(s) - V(s, w))^2]

where VÏ€(s)V_{\pi}(s) is the true value of state s, and V(s,w)V(s, w) is the predicted value of state s.

Δw=−12α∇J(w)\Delta w = -\frac{1}{2}\alpha \nabla J(w)
=αEπ[(Vπ(s)−V(s,w))∇V(s,w)] = \alpha E_{\pi}[(V_{\pi}(s) - V(s, w))\nabla V(s, w)]

We can't compute the expectation exactly, so we use a sample to estimate the expectation:

Δw=α(Vπ(s)−V(s,w))∇V(s,w)\Delta w = \alpha (V_{\pi}(s) - V(s, w))\nabla V(s, w)

Feature Vector

We need to represent the state s as a feature vector x(s). The feature vector is a vector of real numbers that represent the state s. The feature vector is a function of the state s.

For example:

Linear Function Approximation

The value function is a linear combination of the feature vector:

V(s,w)=xT(s)w=∑j=1nxj(s)wjV(s, w) = x^T(s) w = \sum_{j=1}^{n} x_j(s) w_j

where w is the weight vector, and x(s) is the feature vector.

J(w)=Eπ[(Vπ(s)−xT(s)w)2] J(w) = E_{\pi}[(V_{\pi}(s) - x^T(s) w)^2]

now the objective function is quadratic in w, so we can find the minimum of the objective function using the gradient:

∇V(s,w)=x(s)\nabla V(s, w) = x(s)
Δw=α(Vπ(s)−xT(s)w)x(s)\Delta w = \alpha (V_{\pi}(s) - x^T(s) w) x(s)

since the feature vector is incorporated in the update rule, the update rule favors the features that are more important for the value function.

Incremental Prediction Algorithms

In practice we substitute the true value of the state with the return from the sample:

Monte-Carlo (MC) Approximation

Having a training set of states and returns, training data, we can update the weights using the gradient descent algorithm.

⟨S1,G1⟩,⟨S2,G2⟩,⟨S3,G3⟩,…\langle S_1, G_1 \rangle, \langle S_2, G_2 \rangle, \langle S_3, G_3 \rangle, \dots
Δw=α(Gt−V(St,w))∇V(St,w)\Delta w = \alpha(G_t - V(S_t, w)) \nabla V(S_t, w)
=α(Gt−xT(St)w)x(St) = \alpha(G_t - x^T(S_t) w) x(S_t)

Temporal Difference (TD) Approximation

The TD target is Rt+1+γV(St+1,w)R_{t+1} + \gamma V(S_{t+1}, w), it is a biased estimate of the return. Training data is the sequence of states and rewards.

⟨S1,R2+γV(S2,w)⟩,⟨S2,R3+γV(S3,w)⟩,⟨S3,R4+γV(S4,w)⟩,…\langle S_1, R_2 + \gamma V(S_2, w) \rangle, \langle S_2, R_3 + \gamma V(S_3, w) \rangle, \langle S_3, R_4 + \gamma V(S_4, w) \rangle, \dots

Linear function approximation with TD(0) update rule:

Δw=α(Rt+1+γV(St+1,w)−V(St,w))∇V(St,w)\Delta w = \alpha(R_{t+1} + \gamma V(S_{t+1}, w) - V(S_t, w)) \nabla V(S_t, w)
=αδtx(S) = \alpha \delta_t x(S)

where δt=Rt+1+γV(St+1,w)−V(St,w)\delta_t = R_{t+1} + \gamma V(S_{t+1}, w) - V(S_t, w) is the TD error.

TD(λ\lambda) Approximation

The TD(λ\lambda) target is the λ\lambda-return, it is a biased estimate of the return. Training data is the sequence of states and rewards.

⟨S1,G1λ⟩,⟨S2,G2λ⟩,⟨S3,G3λ⟩,…\langle S_1, G_1^{\lambda} \rangle, \langle S_2, G_2^{\lambda} \rangle, \langle S_3, G_3^{\lambda} \rangle, \dots

Forward linear function approximation with TD(λ\lambda) update rule:

Δw=α(Gtλ−V(St,w))∇V(St,w)\Delta w = \alpha(G_t^{\lambda} - V(S_t, w)) \nabla V(S_t, w)

Backward linear function approximation with TD(λ\lambda) update rule:

δt=Rt+1+γV(St+1,w)−V(St,w)\delta_t = R_{t+1} + \gamma V(S_{t+1}, w) - V(S_t, w)
Et=γλet−1+x(St)E_t = \gamma \lambda e_{t-1} + x(S_t)
Δw=αδtEt\Delta w = \alpha \delta_t E_t

where EtE_t is the eligibility trace.


#MMI706 - Reinforcement Learning at METU